Microsoft Malware Detection (Log Loss ~0.022)

Table of Content

  1. Business/Real-world Problem
    1.1. What is Malware
    1.2. Problem Statement
    1.3. Source/Useful Links
    1.4. Real-world/Business objectives and constraints

  2. Machine Learning Problem
    2.1. Data
    2.2. Mapping the real-world problem to an ML problem
    2.3. Train and Test Dataset
    2.4. Useful blogs, videos and reference papers

  3. Exploratory Data Analysis
    3.1. Distribution of malware classes in whole data set
    3.2. Feature extraction
    3.3. Train Test split

  4. Machine Learning Models
    4.1. Machine Leaning Models on bytes files
    4.2. Modeling with .asm files
    4.3. Modeling with .asm pixel intensity files
    4.4. Modeling with bigram byte files
    4.5. Machine Learning models on features of both .asm and .bytes files

  5. Conclusion and Model Comparision

1. Business/Real-world Problem

1.1. What is Malware?

The term malware is a contraction of malicious software. Put simply, malware is any piece of software that was written with the intent of doing harm to data, devices or to people.
Source: https://www.avg.com/en/signal/what-is-malware

1.2. Problem Statement

In the past few years, the malware industry has grown very rapidly that, the syndicates invest heavily in technologies to evade traditional protection, forcing the anti-malware groups/communities to build more robust softwares to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to identify whether a given piece of file/software is a malware.